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Abstract 

We investigate the folding mechanism of the WW domain Fip35 using a realistic atomistic force field 
by applying the Dominant Reaction Pathways (DRP) approach. We find evidence for the existence of two 
folding pathways, which differ by the order of formation of the two hairpins. This result is consistent with 
the analysis of the experimental data on the folding kinetics of WW domains and with the results obtained 
from large-scale molecular dynamics (MD) simulations of this system. Free-energy calculations performed 
in two coarse-grained models support the robustness of our results and suggest that the qualitative structure 
of the dominant paths are mostly shaped by the native interactions. Computing a folding trajectory in 
atomistic detail only required about one hour on 48 CPU's. The gain in computational efficiency opens the 
door to a systematic investigation of the folding pathways of a large number of globular proteins. 
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Unveiling the mechanism by which proteins fold into their native structure remains one of the 
fundamental open problems at the interface of contemporary molecular biology, biochemistry and 
biophysics. A critical point concerns the characterization of the ensemble of reactive trajectories 
connecting the denatured and native states, in configuration space. 

In this context, a fundamental question which has long been debated [[H is whether the folding 
of typical globular proteins involves a few dominant pathways, i.e. well defined and conserved 
sequences of secondary and tertiary contact formation, or if it can take place through a multitude 
of qualitatively different routes. A related important question concerns the role of non-native 
interactions in determining the structure of the folding pathways 

In principle, atomistic MD simulations provide a consistent framework to address these prob- 
lems from a theoretical perspective. However, due to their high computational cost, MD simu- 
lations can presently only be used to investigate the conformational dynamics of relatively small 
polypeptide chains, and are only able to cover time intervals much smaller than the folding times 
of typical globular proteins. 

In view of these limitations, a considerable amount of theoretical and experimental activity has 
been devoted to investigate the folding of protein sub-domains, which consist of only a few tens of 
aminoacids, and fold on sub-millisecond time-scales [4]. In particular, a number of mutants of the 
35 aminoacid WW domain of human protein pinl have been engineered which fold in few tens 
of microseconds Their small size and their ultra- fast kinetics make them ideal benchmark 
systems, for which numerical simulations can be compared with a large body of experimental 
data §m. 

In particular, a MD simulation was performed to investigate the dynamics of a mutant named 
Fip35 (see Fig. [T]), for a time interval longer than 10 /js. Unfortunately, in that simulation no 
folding transition was observed [El HI. 

The folding of this WW domain was later investigated by Pande and co-workers, using a world- 
wide distributed computing scheme IfTOl . According to this study the transition proceeds in a very 
heterogeneous way, i.e. through a multitude of qualitatively different and nearly equiprobable 
folding pathways. 

Noe and co-workers performed a Markov state model analysis of a large number of short 
(< 200 ns) nonequilibrium MD trajectories [11] performed on the WW domain of human Pin 
1 protein. They reported a complex network of transition pathways, which differ by the specific 
order in which the different local meta-stable states were visited. On the other hand, in all path- 
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ways the formation of hairpins takes place in a definite sequence (see e.g. Fig. |2]). In particular, 
from the statistical model it was inferred that in about 30% of the folding transitions, the second 
hairpin forms first, as in the right panel. 

A different conclusion has been reached by Shaw and co-workers, by analyzing a ms-long MD 
trajectory with multiple unfolding/refolding events, obtained using a special-purpose supercom- 
puter llT2ll . In that simulation the WW domain of Fip35 was found to fold and unfold predomi- 
nantly along a pathway in which hairpin 1 is fully structured, before hairpin 2 begins to fold, as 
shown in the left panel of Figj2j In a recent paper [ISOll . Krivov re-analyzed the same ms-long 
MD trajectory in order to identify an optimal set of reaction coordinates. His conclusion was that 
the folding of this WW domain is thermally activated rather than incipient downhill and that the 
transition also occurs through a second pathway, in which hairpin 2 forms before hairpin 1 . The 
statistical weights of the two pathways estimated from the number of folding events are 80% ± 20% 
and 20% ±10%. 

While all these theoretical studies yield folding times in rather good agreement with available 
experimental data on folding kinetics, they provide different pictures of the folding mechanism 
and raise a number of issues. 

Firstly, it is important to assess the degree of heterogeneity of the folding mechanism and to 
clarify whether the most statistically significant folding pathways are those in which the hairpins 
form in sequence. Important related questions are also whether the folding mechanism is corre- 
lated with the structure of the initial denatured conditions from which the reaction is initiated and 
with the temperature of the heat bath. Finally, it is interesting to address the problem of the relative 
role played by native and non-native interactions in determining the structure of folding pathways. 
Indeed, while native interactions are arguably shaping the dynamics in the vicinity of the native 
state, non-native interactions may in principle play an important role in the transition region and 
at the rate limiting stages of the reaction. 

In order to tackle these questions, in this work we use the Dominant Reaction Pathways (DRP) 
approach [[T4l - fT8l . a framework which allows to very efficiently compute the statistically most 
significant pathways connecting given denatured configurations to the native state at an atomistic 
level of detail, with realistic force fields. To further support our results and to study the role of 
native and non-native interactions we map the free energy landscape by performing Monte Carlo 
simulations in two coarse-grained models. 

Our study confirms that Fip35 folds mostly through the two folding channels discussed above 
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and shows that the relative weight of the two channels changes with temperature. In addition, we 
find that the folding pathways are correlated with the initial condition from which the transition is 
initiated. The studies based on the coarse-grained model suggest that the folding dynamics in the 
transition region is not significantly influenced by non-native interactions. 

I. METHODS 

A. Atomistic Force Field 

Our atomistic simulations of the dominant folding trajectories of the Fip35 WW domain were 
performed using the AMBER ff99SB force field [|T9l in implicit solvent with Generalized Bom 
formalism implemented in GROMACS 4.5.2 [|20ll . The Bom radii were calculated according to 
the Onufriev-Bashford-Case algorithm [|2T|. 

In a recent work based on the DRP method, the dominant pathway in the conformational tran- 
sition of tetra-alanine obtained using the same version of the AMBER force field was found to 
agree well with the results of an analogous calculation in which the molecular potential energy 
was determined ab initio, i.e. directly from quantum electronic structure calculations [22]. 

B. Coarse-grained Model 

To study the equilibrium properties of the folding of the Fip35 WW domain we used the coarse- 
grained model recently developed in Ref.s [|23ll24]| . In that model, aminoacids are represented by 
spherical beads centered at the Ca positions. The non-bonded part of the potential energy contains 
both native and non-native interactions. The former are the same used in the Go-type model of 
Ref. [25 J, while the latter consist of a quasi-chemical potential, which accounts for the statistical 
propensity of different aminoacids to be found in contact in native structures, and of a Debye- 
screened electrostatic term. In this model, the average potential energy due to native interactions in 
the folded phase is typically one order of magnitude larger than that due to non-native interactions. 
Above the folding temperature, this ratio drops to about 4. 

This model was shown to provide an accurate description of protein-protein complexes with 
low and intermediate binding affinities [1231 . In the insert of the first panel of Fig. [3] we plot the 
specific heat, evaluated from MC simulations at different temperatures, which indicates that this 
model yields the correct folding temperature for this WW domain. 
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C. The Dominant Reaction Pathways Method 



The high computational cost of MD simulations of macromolecular systems has triggered ef- 
forts towards developing alternative theoretical frameworks to investigate their long-time dynam- 
ics and reaction kinetics (see e.g. [fT4l |26] - I28l [30l - l32l and references therein). 

In particular, the DRP approach [[T4] - fT8l[33ll concerns physical systems which can be described 
by the overdamped Langevin equation. If denotes the coordinate of the ^— th atom, the Langevin 
equation in the so-called Ito Calculus reads: 



Xk{i+l) = Xk{i) - —^VU[X{i)] + ^2DkAty\k{i). (1) 
Kb! 

In this equation, X(/) = (xi (?'),..., xyv(/)) is the set of atomic coordinates at the th time step. A? is 
an elementary time interval, is the diffusion coefficient of the ^— th atom, kg is the Boltzmann's 
constant, T is the temperature of the heat-bath, t/(X) is the potential energy. V[^{i) is a white 
Gaussian noise with unitary variance, acting on the ^— th atom. 

The probability for a protein to fold in a given time interval t can be written as 

Pf{t) = I dXf h^iXf) I dX, hD{Xi) P{Xf,t\Xi) po(X,0, (2) 

where /j^(£))(X) is the characteristic function of the native (denatured) state (defined in terms of 
some suitable order parameters), Po(X/) is the initial distribution of micro-states in the denatured 
state and /'(Xy^,f|X,) is the conditional probability of reaching the (native) configuration Xf start- 
ing from the (denatured) configuration X,, in a time t. If the total time interval t is chosen much 
smaller than the inverse folding rate, this probability is dominated by single nonequihbrium fold- 
ing events. 

It can be shown that the probability of a given folding trajectory X{t) connecting denatured and 
native configurations is proportional to the negative exponent of the Onsager-Machlup functional 
[|3Tl[33ll . which in discretized form reads 

N, N 



Prob[X] <x exp 



(x\i+l)-x'{i) + ^VU[X{i)] 



(3) 



where Nt is the number of time steps in the trajectory. On the other hand, the paths which do not 
reach native state before time t do not contribute to the transition probability in Eq. |2j The most 



probable — or so-called dominant — reaction pathways are those which minimize the exponent in 
Eq. |3] In principle, these may be found by numerically relaxing the effective action functional [fT4l 

m 

(=1 

where Ve//(X) is the so-called effective potential, and reads 

VeffiX) = ^^^Y^Dk (|V,C/(X)p-2fcsrv2f/(X)). (5) 

In practice, for a protein folding transition, directly minimizing the effective action in Eq.|4]is 
unfeasible, since at least lO'^ — 10^ time steps are needed to describe a single folding event. On 
the other hand, for any fixed pair of native and denatured configurations the dominant paths can be 
equivalently found by minimizing an effective Hamilton- Jacobi (HJ) action in the form [fT4l [33l 

Shj = £ AZ,-,-+i (%/-fK//[X(0]), (6) 

where and A/,_|_i = A/(X(r+Ty^^^(OP represents the elementary displacement in configuration 
space, and for sake of clarity, we have assumed that all atoms have the same diffusion coefficient. 
The parameter Egff determines the time at which any given frame / of the path is visited: 

t{l) = £ AZ, ,+ 1 [mEeff + Veff[m] (7) 
i<l 

Hence, by adopting the HJ formulation of Eq. |6| it is possible to replace the time discretiza- 
tion with the discretization of the curvilinear abscissa /, which measures the Euclidean distance 
covered in configuration space during the reaction [f33l . This way, the problem of the decoupling 
of time scales is bypassed. As a result, only about 10^ frames are usually sufficient to provide a 
convergent representation of a trajectory. On the other hand, the HJ formalism requires to perform 
an optimization in the space of reactive pathways of a functional, which can take complex values, 
which is in general a complicated task. 

The DRP approach displays important similarities with the SDEL (Stochastic Difference Equa- 
tion in Length) method developed by Elber and co-workers BOll . In particular, while the DRP is 
based on minimizing the effective HJ action Sdrp =1/ / ^//-\/K//[X(/)] +Eeff, in SDEL the 
folding trajectories are obtained by extremizing the physical HJ action Ssdel = ! A/f/[X(/)] —E, 
where U {x) is the potential energy and E is the conserved total mechanical energy. 
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+ K//[X(/)] 



(4) 



The SDEL algorithm is presently implemented in the MOIL software for molecular model- 
ing [l39ll . In this code, the protein folding trajectories are obtained by means of simulating an- 
nealing relaxation, starting from an initial trial trajectory -see e.g. Ref. POl — . The DRP 
algorithm is now implemented in DOLOMIT (Dynamics Of muLti-scale mOdels of Molecular 
transITions), a program under development at the Interdisciplinary Laboratory for Computational 
Science (Trento). This code uses the MD-based algorithm described in the next section to identify 
the dominant path. 

D. Exploration of the Path Space 

The reliability of the DRP approach in investigating the protein folding transition crucially 
depends on the efficiency of the algorithm used to find optimum paths. In the analysis of confor- 
mational [fTSl l22l or chemical [[TSlI reactions of relatively small molecules, dominant paths can be 
found by directly optimizing the HJ action in Eq. |6} e.g. using simulated annealing or gradient- 
based methods. The DRP calculations for protein folding obtained this way have been extensively 
tested using reduced models in which the relevant degrees of freedom are individual aminoacids 
and the energy landscape was relatively smooth [|37l |38l . Unfortunately, when moving from a 
coarse-grained to an atomistic description, a large amount of frustration is introduced and the en- 
ergy surface of the model becomes much rougher. Consequently, the relaxation algorithms adopted 
in our previous calculations were found to provide a poor exploration of the space of folding paths 
in an all-atom calculation. 

In order to overcome this problem, we have used a biased MD algorithm to efficiently produce 
a large ensemble of paths, starting from a given denatured configuration and reaching the native 
state [|28l|4ll|42l. In particular, the so-called ratchet-and-pawl MD (rMD) algorithm [28] exploits 
the spontaneous fluctuations of the system along a specific collective coordinate (CC), towards its 
native configuration. This is done by introducing a time-dependent bias potential V/;(X,?), whose 
purpose is to make it very unlikely for the system to evolve back to previously visited values of 
the CC. On the other hand, this bias exerts no work on the system when it spontaneously proceeds 
towards the native state. We emphasize that this approach is quite different from the one used in 
steered-MD [|43|, where an external force is continuously applied to the system, in order to drive 
it towards the desired state. 

Following the work of Ref. [28] we chose a CC zit), which defines the distance between 
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the contact map in the instantaneous configuration X{t) from the contact map in the native con- 
figuration X"^''^*^. This way, the energy optimized native configuration has by definition z = 0. 
On the other hand, a bias on z does not force nor lock any specific contact, but only imposes a 
(quasi) monotonic behavior of the total number of native contacts. 

In particular, the biasing potential introduced in Ref. Il28ll is defined as 



v .V A ) for z[X(f)] > z^{t) 

0, for Z%{t)\<Zm{t). 

In these equations, Zmif) is the minimum value assumed by the collective variable z along the rMD 
trajectory, up to time t. 

The value of the collective variable z in the instantaneous configuration is defined as: 

z[X(0]=i;[C,,[X(0]-Q,(X-'-)]2. (9) 
'J 

The entries of the contact map C,y are chosen to interpolate smoothly between and 1, depending 
on the relative distance of the residues / and j: 



Q,(X) = (10) 

a 

where ro=7.5 A is a fixed reference distance. The variable Zm(t) is updated only when the system 
visits a configuration with a smaller value of the CC, i.e. any time z[X(? + ht)\ < Zm{t). 

The value of the spring constant Rr in the ratchet potential — see Eq. |8] — controls the amount 
of bias introduced by the ratchet algorithm. At very low values of k^, the rMD trajectories are 
minimally biased. However, in this limit the system performs the first folding transition on typical 
time intervals comparable with the folding time ? ~ 1/^/ — where kf h the folding rate — . On 
the other hand, the DRP method is based on the transition probability given Eq. |2} where the time 
interval t is of the order of the transition path time tTPT ^ 1/^/- Hence, DRP simulations based 
on ratchet simulations in which kR is chosen too small are computationally very expensive, since 
most paths do not reach the native state within the time interval t entering Eq. |2| hence do not 
contribute to the transition probability. 

In the opposite high kR limit, the bias force becomes comparable with the physical internal 
forces acting on the atoms and the dynamics is affected by a significant bias. In this regime, if the 
ratcheting coordinate is not optimal, the system is driven into relatively large free energy regions. 
The unbiased statistical weight given by Eq. |3]penalizes these trajectories. In the extremely high 
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kR limit, the bias towards forming native contacts is very large, and breaking a native contact which 
is present in a (partially) misfolded configuration becomes extremely unlikely. As a result, also 
rMD simulations in which is chosen too large have a low folding yield, since the trajectories 
have a large probability to be trapped in misfolded states. 

This algorithm allows to efficiently generate a large number of trajectories starting from the 
same configuration and reaching the native state, hence it can be used to explore the folding path 
space. Eq. |3]provides a rigorous way to score such trial trajectories, i.e. to evaluate the probability 
for each of them to be realized in an unbiased overdamped Langevin dynamics simulation. In 
particular, the best estimate for the dominant folding pathway is the one with the smallest Onsager- 
Machlup action. The path may then be used as a starting point for a further refinement based 
on a local relaxation of the HJ action given by Eq. |6} performed by means of the optimization 
algorithms described in our previous work (see e.g. Ref. Il22ll44ll ). 

The second refinement step is computationally very expensive, requiring several thousands of 
CPU hours for each dominant trajectory. However, by performing a number of test simulations, 
we have found that it produces only very small rearrangements of the chain, mostly filtering out 
small thermal fluctuations (see e.g. Figj4]). 

Some comments on the minimization procedure described above are in order: first, we empha- 
size that since the dynamics generated by the rMD algorithm has infinite memory, it violates the 
detailed-balance condition and the time intervals computed from the biased trajectories have no 
direct physical meaning. On the other hand, once a minimum of the HJ action has been identified, 
it is in principle possible to reconstruct the physical time at which each frame is visited, by means 
ofEq.|7} 

As long as one is concerned mostly with the global qualitative aspects of the folding mecha- 
nism, the expensive refinement session of the DRP calculation may be dropped. This allows us to 
reduce the total computational time required to perform the analysis by several orders of magni- 
tude. On the other hand, we are not in a condition to provide a rehable estimate of the time interval 
that the protein takes to complete a folding reaction. 

We stress that the present DRP approach is expected to give unreliable results if the sampling of 
the trial trajectories is too limited or if the CC adopted in the bias potential of the rMD algorithm 
was a bad reaction coordinate. A number of recent studies have shown that the total number of 
native contacts represents reasonable reaction coordinate for the folding of small globular proteins 

mmm. 
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TABLE I: Summary of the simulation details 



Unfold. MD steps 



Therm. MD steps 



Fold. rMD steps 



Denatured 



Rejected 



(T= 1600 K) 



(T= 300 K) 



(T= 300 K) 



initial conditions 



50 X 10 



100 X 10^ 



50 X 10^ 



44 



18 



"In all MD and rMD simulations the integration time step has been chosen to be 1 fs. 
E. Computational Procedures and Simulation Details 

The atomistic DRP calculations were performed using DOLOMIT. The code calls a librarized 
version of GROMACS 4.5.2 [20] to calculate the molecular potential energy and its gradient. 

We defined the native state as the set of conformations with a RMSD to the crystal structure of 
the C(x in the hairpins smaller than 3.5 A. A configuration was considered denatured if the RMSD 
to native of both hairpins was larger than 6A. The stability of the native state within the present 
force field was checked by running 12 unbiased 2 ns-long MD simulations at the room tempera- 
ture (300 K). In all such trajectories the protein remained in the native state. We then generated 44 
independent initial conditions, by running a 50 ps MD at 1600 K, starting from the energy min- 
imized native state, followed by a 100 ps relaxation at 300 K. The time step employed in all the 
simulations was 1 fs. From each of the 24 starting configurations we performed 96 independent 
runs, each consisting of 50000 rMD steps. For each of the remaining 20 initial conditions, the 
number of trial paths was limited to 48. 

In our simulations, we have chosen the ratchet spring constant kR = 0.02 kcal/mol, which 
represents a reasonable compromise between keeping a high computational efficiency for the path 
exploration and introducing a small bias. Indeed, using this value, the modulus of the biasing force 
is always at least one order of magnitude smaller than the norm of internal forces. 

We observed that 5 of the 44 initial conditions did not correspond to denatured states, hence they 
were rejected. In addition, in 13 of the remaining 39 sets, more than 80% of the trial trajectories 
did not reach the native state within the simulation time. In these cases the exploration of the 
path space was limited to very few trial trajectories, so the corresponding dominant paths were 
discarded. For the remaining 26 sets of paths, we identified the most probable by computing the 
OM action of Eq. |3j The complete set of atomistic simulations required less than two days of 
calculation on 48 CPU's. 
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It is important to study to what extent DRP results obtained this way depend on the value of 
the bias constant adopted in the rMD simulations. In Fig. [5] we plot the dominant reaction 
pathway obtained starting from the same initial condition, using different values of kn which span 
over almost two orders of magnitude. We see that in most simulations the folding occurs through 
the same qualitative pathway, in which the first hairpin forms before the second begins to fold. 
Only in one case — for a low value of the coupling constant kR — we find that the protein travels 
across the denatured state before taking a different pathway to the native state, in which the order of 
formation of the hairpins is reversed. Such a trajectory spends a much longer fraction of rMD steps 
in exploring the denatured state and initiates the transition from a very different configuration. 

It is interesting to note that the most probable path turns out not to be the one with the lowest 
ratchet constant. In particular, the trajectory taking the second pathway (labelled with in Fig. 
|5]) is among those with the lowest statistical weight. This is because the OM action tends to 
penalize paths which travel long Euclidean distances in configuration space. This result illustrates 
that choosing a low ratchet constant k^ produces an enhanced exploration of the denatured state, 
but does not necessarily lead to a more efficient identification of the most probable trajectories 
connecting given boundary condition at fixed times. 

Once a dominant path has been found, it is relatively straightforward to identify the configura- 
tion which belongs to the transition state ensemble. This can be done by finding the frame Xy5 in 
the trajectory such that the probability to reach the native state is equal to that of going back to the 
denatured configuration [ITSlI : 

Prob[Xr5^ Unfolded] e-^oM(Xrs^Xo) 

— = 1 (11) 

Prob [X 7-5 Native] e^^oMO^Ts^^N) 

In this equation Xa? and X/) are the first native and denatured configurations visited along the 
dominant path, starting from Xj-^. In order to locate them, we need to only take into account the 
"reactive" part of the path, that is the one which leaves the denatured state and, without recrossing, 
goes straight to the native. To satisfy this requirement, we considered the total RMSD versus 
frame index curve. The typical trend of this curve for most of the dominant trajectories is shown 
in the lower panel of Fig. [6j it consists in an initial plateau, followed by a rather steep fall, and 
then by another flat region, where the system oscillates in the native state. The reactive part of the 
path was identified with the region of steep fall in this curve. In particular, the beginning of the 
transition was set to the frame at which the derivative of the total RMSD curve changes sign, from 
positive to negative. 
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The simulations of the equilibrium properties of the WW-domain in the coarse-grained 
models were performed using a Monte Carlo algorithm based on a combination of Cartesian, 
crankshaft PTll and pivot [|48l moves. The free energy as a function of an arbitrary set of reaction 
coordinates (potential of mean force) was obtained from the frequency histogram calculated from 
long MC trajectories. 

n. RESULTS AND DISCUSSION 

In the upper panel of Fig j6] we show our set of atomistic dominant folding trajectories, projected 
onto the plane defined by the RMSD to the native structure of the Ca atoms in residues 8 — 23 
(hairpin 1) and 17 — 30 (hairpin 2). 

Two distinct folding pathways which differ by the order of formation of the hairpins can be 
clearly identified: in about half of the computed dominant folding pathways hairpin 1 consistently 
folds before hairpin 2. In this channel, we find the transition state is located at the "turn" of 
the paths, i.e. is formed by configurations in which the hairpin 1 is folded while hairpin 2 is 
largely unstructured (see pathway 1 in Fig. |2]). This is the mechanism predominantly found in the 
simulation of Ref. [fT2||. performed using the same force field, albeit in explicit solvent. 

In about half of the computed dominant paths, we observe that the two hairpins form in the 
reversed order. In this channel, the transition state is formed by the configurations in which hairpin 
2 is folded, while hairpin 1 is unstructured (see pathway 2 in Fig.|2]). 

An important point to emphasize is that this clean two-channel folding mechanism is a pre- 
diction of the DRP approach which could not be obtained if we analyzed directly the full set 
of (biased) trajectories obtained from the rMD algorithm, without scoring their probability. For 
example. Fig. |7] shows that not all the rMD trajectories computed starting from a given initial con- 
dition follow one of the two folding pathways discussed above. Indeed, many of them involve a 
simultaneous formation of native contacts in both hairpins. A clear prediction of the DRP formal- 
ism is that folding events in which the hairpins form simultaneously are much less frequent than 
those in which the two secondary structures forms in sequence. 

Another result emerging from our DRP calculation is the existence of a correlation between 
the structure of the initial conditions from which the transition is initiated and the pathway taken 
to fold: if at the beginning of the transition the first hairpin has a RMSD smaller than the second 
hairpin, then the first pathway is most likely chosen. In the opposite case, i.e. when the second 
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hairpin has a smaller RMSD to native than the first, then the second pathway is generally preferred. 

In order to further support these results and gain insight into the folding mechanism, we have 
performed simulations in an entirely different approach, i.e. by computing equilibrium properties 
using the coarse-grained models described in the Methods section. In Fig. |3] we show the free 
energy landscape at the 300 K, as a function of the RMSD to native of the two hairpins for the 
two models, which differ by the presence of non-native interactions. In both cases, we observe the 
existence of two valleys in the free energy landscape, which correspond to the two folding path- 
ways discussed above. Remarkably, the same structure for this free energy map was obtained by 
Ferrara and co-workers for the 20-residue peptide beta3s — which shares the same native topology 
of WW domains [f5T| — by means of equilibrium atomistic simulations based on the CHARMM 
force field, in implicit solvent. 

We emphasize that this set of qualitatively consistent results has been obtained using different 
algorithms (in equilibrium and nonequilibrium conditions) and different theoretical models (atom- 
istic and coarse-grained). The fact that models with and without non-native interactions give very 
similar free energy landscapes suggests that the structure of the two folding pathways of WW 
domains is mostly shaped by native interactions. 

A. Comparison with experimental data on folding kinetics 

In general, all experimental data on folding kinetics of WW domains indicate that the formation 
of the first hairpin is the main rate limiting step [EJ-Q. In particular, the (|)-values measured by 
Jager and co-workers display a clear peak in the region associated with hairpin 1, but significant 
(|)-values were reported also for residues in the sequence region relative to hairpin 2 Q. This fact 
indicates that the folding of the latter structure has some rate limiting effect. In addition, it was 
found that the (|)-values in the region of the second hairpin grow with temperature, while those in 
the region of the first hairpin decrease. This implies that, at higher and higher temperatures, the 
second hairpin plays an increasing role in the folding mechanism. 

An analysis based on (|)— values alone does not permit to fully characterize the folding mech- 
anism. In particular, it cannot distinguish between a single-channel folding mechanism in which 
native contacts in the two hairpins form simultaneously and a multiple-channel folding mecha- 
nism in which the reaction rate in each channel is limited by the folding of one of the hairpins. 
In Ref. [|49l Weikl has shown that the full body of existing (|)-value data (taken from Ref.s [[6113) 
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can be consistently and quantitatively explained by a simple kinetic model in which the folding of 
WW domains occurs through alternative channels, which correspond to the two pathways found 
in our DRP simulations. From a global fit of the experimental data, the author concluded that the 
relative probability of the first folding pathway for FBP and Pin 1 WW domain are 77% ± 5% and 
67 ± 5%, respectively. 

Let us now discuss the relative statistical weight of the two folding pathways. To this goal we 
need to estimate and compare the reaction rates in the two channels. The formalism for evaluating 
reaction rates in the DRP approach was developed in detail in Ref. lllTll . where it was shown that 
this method reproduces Kramers theory in the low-temperature regime. Applying that formalism, 
the ratio of the folding rates in the two channels reads 

^ ^ ^ e-P(Grsi-Grs2)^ (12) 



where the label 1 (2) identifies the channel in which hairpin 1 (2) folds first. In Eq. 12 the 
exponent contains the difference of the free-energies of the two transition states, defined from the 
dominant trajectories according to the commitment analysis described in the Methods section and 
in Ref. [|151 . In particular, one has 

e-mrs,) ^JdX .-P^W5 [(X - X',,) ■ n,,,] (/ = 1,2), (13) 

where X^^ is a point of the transition state which is visited by a typical dominant path in the 
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to 

are 



z— th reaction channel and nj^i is a versor tangent to the dominant path at X'^,^. Using Eq 
identify the transition states, we have found that the two partition functions defined in Eq. 13 
dominated by configurations in which one of the hairpin is fully formed while the other is still 
completely unstructured (see Fig. |6]). The average location of the computed transition states in the 
plane defined by the RMSD to native of the two hairpins is highlighted in the lower panel of Fig. 
|3]and can be identified with the endpoints of the two low free energy valleys. 



The coefficients and in the prefactor of Eq. 12 are defined in terms of quantities which 
can be calculated from the dominant paths — see Ref. lfTTl for details — . These terms estimate 
the average flux of reactive trajectories across the iso-committor dividing surface, including the 
contributions from small thermal fluctuations around the dominant paths. Unfortunately, evaluat- 
ing Kq and necessarily requires to perform the computationally expensive local optimization of 
the HJ action. However, if the reaction is thermally activated, the ratio of rates ki/k2 is mostly 
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controlled by the exponential contribution. Hence, we can consider the Arrhenius approximation 

S2). (14) 



^1 - -p (Grs, -Gts, 



We emphasize that in this approach the DRP information about the nonequilibrium reactive dy- 
namics is used to define the two transition states. On the other hand, the numerical value of the free 
energy difference may be obtained from equilibrium techniques, e.g. by sampling the integrals in 



Eq. 13 by means of computationally very expensive umbrella sampling or meta-dynamics 115211 
atomistic calculations. 

In this first exploratory application of the DRP formalism to a realistic protein folding reaction 
we choose to perform a much rougher estimate which relies on two main approximations. First, 



we identify the difference {GtSi " ^tSt) in Eq. 14 with the difference of the free energy in the two 



shaded regions of the energy landscape shown in Fig. [3} The centers of these regions represent the 
average location of the configurations in the two transition states TS 1 and TS2 obtained from DRP 
simulations, projected onto the plane selected by the RMSD to native of the two hairpins. The 
sizes of the shaded area represents the errors on the average location of the transition states on this 
plane, estimated from the standard deviation. The second assumption of our model is that such a 
free energy difference is driven by the balance between energy gain and entropy loss associated to 
the formation of native contacts in the two hairpins. This native-centric standpoint is supported in 
part by the fact that free energy landscapes computed in different models with and without non- 
native interactions are found to be very similar, as it is clear from comparing the panels in Fig. [3| 
Hence, to estimate GtSx ~ ^tSi we used the Go-type model described in the Methods section. 
It is important to emphasize that we are not computing the rate directly from a transition state 
theory formulated in the coarse-grained model, but we are using it only to estimate a free energy 
difference. 

This way, we obtained an estimate k\/k2 — 2.3 which corresponds to a relative weight of the 
first folding channel of 70% and 30%. We stress that, such a simple calculation should be consid- 
ered only a rough estimate. It indicates that the two channels have more or less comparable weight 
and that the first channel is the most probable, in qualitative agreement with experimental results 
and with the simulations of Shaw and co-workers. 

This simple scheme enables us to address the question of the dependence of the relative weights 
of the two channels on the temperature. Repeating the calculation at a higher temperature of 
380 K — assuming that the structure of the transition states is not significantly modified — we 
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find k\/k2 ^ 1.6, which corresponds to a branching ratio of channel 1 of about 60%. Hence, the 
rate limiting role of the second channel grows with temperature, in qualitative agreement with 
experimental kinetic data. 

This fact can be understood as follows. The folding of one of the hairpins generates an entropy 
loss proportional to the number n of native contacts formed. More precisely, assuming that the 
protein in the unfolded state is completely denatured then the entropy loss in the TS of the first 
(second) channel is ^ts\[ts2) = {^2)s, where s is the entropy loss for locking one residue in its 
native conformation, while ni = 15 and ^2 = 13 are the numbers of residues in the two hairpins of 
Fip35. The transition state in the first channel involves forming a longer hairpin, hence reaching it 
involves a larger entropy loss (but also larger gain of native energy). The role of the entropy loss 
relative to the energy gain in forming the hairpins grows with temperature, hence disfavoring the 
first folding charmel relative to the second. 

III. CONCLUSIONS 

In this work we have studied the folding mechanism of the WW domain Fip35 in equilib- 
rium and nonequilibrium conditions, using an atomistically detailed force field and coarse-grained 
models based on knowledge-based potentials. 

The obtained folding trajectories are not heterogeneous, but rather suggest that the folding 
proceeds through two dominant chaimels, defined by a hierarchical order of hairpin formation. In 
the most probable channel at room temperature, the first hairpin is almost fully formed before the 
second hairpin starts to fold. In the alternative route, the order of hairpin formation is reversed. 
Our simulations also suggest that the choice of the folding pathway is correlated with the structure 
of the denatured configuration from which the peptide initiates the folding reaction. Our results 
are compatible with both Weikl's analysis of kinetic data and Krivov's analysis of equilibrium MD 
simulations. 

The most important result of this work is to show that, using the DRP approach, it is possible to 
characterize at least the main qualitative aspects of the folding mechanism at an extremely modest 
computational cost, in the range of few hundreds of CPU hours. Such a level of computational 

efficiency opens the door to the investigation of the folding pathways of a large number of single- 
domain proteins, with sizes significantly larger than that of the small domain studied in the present 
work. 
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FIG. 2: Schematic representation of the structure of the two folding pathways obtained in our DRP 
simulations. 
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FIG. 3: The free energy at T = 300i^ as a function of the RMSD to native of the two hairpins, obtained 
from the Monte Carlo simulations in two coarse-grained models, described in the Method section. In 
the upper panel the model accounts for both native and non-native interaction, in the lower panel the 
model contains only native interactions. In the insert of the upper panel, we show the corresponding 
plot of the specific heat. Both models predict the existence of two distinct folding paths. The two 
shaded regions in the lower panel, with coordinates in the range (3.5 it 0.5 A, 7 it 0.5 A) and (9ib 0.5 
A, = 3.0 lb 0.5 A) are identified with the average location of the transition states obtained with DRP 
simulations. The average of the free energy in these region are used to estimate the relative rate of the 
two reaction channels. 
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FIG. 4: The effect of the minimization of the HJ functional |6] on a trial path obtained from the rMD 
algorithm on the RMSD to native of the Ca atoms in the backbone. This second refinement step does 
not change the qualitative structure of the dynamics. 



12 



10 



(N 

"S. 
Q 




kO = 0.0023 




kl = 0.0092 


X 


k2 = 0. 


016 




k3 = 0. 


023 


□ 


k4 = 0. 


030 




k5 = 0. 


037 


O 


k6 = 0. 


044 


• 


k7 = 0. 


051 


A 


k8 = 0. 


058 




k9 = 0. 


065 




klO = 0. 


071 


▼ 


kll = 0. 


078 




kl2 = 0. 


085 




kl3 = 0. 


092 




kl4 = 0. 


099 


• 


kl5 = 


. 11 





kl6 = 


. 12 


o 


kl7 = 


. 13 


o 


kl8 = 


. 14 




kl9 = 


.15 


o 



RMSD hairpin 1 (A) 



FIG. 5 : Dominant pathways obtained with different values of the ratchet force, starting from the same 
initial condition. The values of kn are given in units of kcal/mol. 
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FIG. 6: Upper panel: the set of dominant folding paths for Fip35, obtained from atomistic DRP 

simulations, projected on the plane defined by the RMSD of the two hairpins to the corresponding 

native structures. The dark spots represent a few typical configurations in the two transition states, 

evaluated by the requesting a probability 1/2 to reach the native state. Lower panel: evolution of 

the RMSD to native of the full protein and of the hairpins along a dominant trajectory. The shaded 

area is the reactive region and the dashed line identifies the transition state configuration obtained 
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according to Eq. [11} 
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FIG. 7: Upper panel: The set of trial paths connecting a given denatured configuration to the native 
state, used in the search for the dominant paths, projected on the plane defined by the RMSD to native 
of the two hairpins. The darker path is the selected dominant reaction pathways. Lower panel: the 
evolution of the biasing coordinate, in two typical folding trajectories of this set. 
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